The automated Galaxy-SynBioCAD pipeline for synthetic biology design and engineering

Here we introduce the Galaxy-SynBioCAD portal, a toolshed for synthetic biology, metabolic engineering, and industrial biotechnology. The tools and workflows currently shared on the portal enables one to build libraries of strains producing desired chemical targets covering an end-to-end metabolic pathway design and engineering process from the selection of strains and targets, the design of DNA parts to be assembled, to the generation of scripts driving liquid handlers for plasmid assembly and strain transformations. Standard formats like SBML and SBOL are used throughout to enforce the compatibility of the tools. In a study carried out at four different sites, we illustrate the link between pathway design and engineering with the building of a library of E. coli lycopene-producing strains. We also benchmark our workflows on literature and expert validated pathways. Overall, we find an 83% success rate in retrieving the validated pathways among the top 10 pathways generated by the workflows.


Table of Contents
Galaxy-SynBioCAD tools 2 Table S1. Galaxy-SynBioCAD tools 6 Tools design and integration process 8 Figure S1. Development and integration process. 9 Publishing a tool in the Galaxy ToolShed 10 Literature Pathways matching algorithm 20 Reaction matching score 20 Pathway matching score 21 Matching Threshold 21 Figure S9: Score distribution of all predicted pathways. 22 Supplementary References 22 recognized with the tools internal database) or by decomposing the chemical structure and calculating its formation energy using the component contribution method. Thereafter, the species involved in a reaction are combined (with consideration for stoichiometry) and the thermodynamic feasibility of the pathway is estimated by taking the sum of the Gibbs free energy of each participating reaction. The node takes as input pathways in SBML format and returns annotated pathways (with thermodynamics information for each reaction, see Methods section for further details) also in SBML format.
rpFBA is used to calculate target production fluxes of the designed pathways. To perform Flux Balance Analysis (FBA) on a heterologous pathway, this tool first merges a heterologous pathway with a user-specified GEM model. The GEM models should be in the SBML format such as those available in the BiGG 4 and MetaNetX 5 databases. When merging the GEM model with heterologous pathways, compounds that cannot carry any flux are temporarily removed from the reaction for the FBA evaluation. Such cases can happen due to side substrates or products of predicted reactions that do not match any chassis compound. This enables FBA to consider whole-cell conditions for the theoretical production of the user's target molecule. The tool uses the COBRApy package to perform FBA 6 . The native COBRApy methods supported are FBA and parsimonious FBA (pFBA). The tool also contains an in-house developed method ("fraction of reaction" described in section Methods) to consider the potential burden that the production of a target molecule may have on the cell and the impact of the target itself. The node takes as input pathways like those produced by rp2Path and a strain model both in SBML format and returns annotated pathways (with calculated fluxes, see Methods section) in SBML format.
rpScore provides a global score for a given pathway. This score is computed by a machine learning (ML) model (cf. Machine Learning Global Scoring in main text). The model takes as input features describing the pathway (thermodynamic feasibility, target flux with fixed biomass, length) and the reactions within the pathway (reaction SMARTS, Gibbs free energy, enzyme availability score) and prints out the probability for the pathway to be a valid pathway. The ML model has been trained on literature data (cf. section Benchmarking with literature data in main text) and by a validation trial (cf. section Benchmarking by expert validation trial in main text). Based on this global pathway score, a dedicated Galaxy wrapper (rpRanker) ranks a set of heterologous pathways to reveal what are the most likely pathways to produce the target molecule in an organism of choice.
rpReport generates HTML pages to visualize the main characteristics of the predicted pathways. Predicted pathways are summarized in a table, providing a quick overview of important characteristics, namely thermodynamics (∆G'm), fluxes (FBA for target production), the number of metabolic steps, reaction rule score (inherited from RetroRules rules), and the global score. Selecting one pathway shows information on its individual reactions using barplots, and lists of EC numbers with crosslinks. Selecting several pathways from the table render bar plots that compare these pathways. HTML reports are displayed using a responsive layout, it can be exported, and reports do not require an internet connection to be explored. The RP report node takes as input one or several pathways files in SBML format (.xml files), merged as a tar archive or in a folder. SBML files are parsed using python, and the HTML reports involve JavaScript code.
rpViz provides an interactive web interface for exploring predicted pathways and their associated annotations. The tool extracts information from the predicted pathways described as SBML files, and produces a HTML web page. The web page relies on JavaScript code and is a "dependencyfree" output easy to set up locally for the user. Possible user interactions are pathway highlighting, cofactor handling, and the viewing of information at the levels of pathways, reactions, and involved compounds. The node takes as input pathways in SBML format.
Selenzyme 7 is an open-source tool that performs enzyme sequence selection from a reaction query. The tool can be queried using a reaction template such as the reaction rules in RetroRules. This feature makes this tool especially useful in combination with RetroPath2.0. Selenzyme performs a reaction similarity search in the reference reaction database Metanetx 5 and outputs the sequences annotated for the closest reactions. The tool provides several scores that can be combined in order to define an overall score. Scores are given for reaction similarity, conservation based on a multiple sequence alignment of the result, phylogenetic distance between source organism and host, and additional scores calculated from sequence properties. Selenzy takes input pathways in SBML format and returns annotated pathways (with UniProt ID for each reaction, see Method section) also in SBML format.
SbmlToSbol provides the mapping from the theoretical space to the practical space. This tool takes a pathway model (encoded in SBML) as input and returns a collection of placeholders for the subsequent design of the synthetic DNA that is required to encode the enzymes defined in the pathway model (encoded in SBOL). The converter first parses the SBML model and extracts a user-specified number of homologous enzymes for each metabolic reaction. Synthetic gene design templates, in the form of SBOL ComponentDefinitions, are generated for each enzyme, each consisting of an (enzyme) coding region (specified by a Uniprot sequence identifier), 5' and 3' flanking regions for downstream assembly, and -optionally -ribosome binding sites of userspecified translation initiation rates, allowing for the control of translational regulation. The SBOL document contains no sequence data but acts as a template to be passed onto the next node, PartsGenie.
PartsGenie is an established web application for the design of reusable synthetic DNA parts 8 . It supports the integrated design and optimization of ribosome binding sites, coding sequences, and other features, providing a multi-objective optimization algorithm that simultaneously optimizes translation initiation rate and codon usage along with elimination of repeating nucleotides and unwanted restriction sites. Furthermore, PartsGenie also implements guidelines from DNA manufacturers to optimize sequences for synthesisability, including the reduction of both local and global GC content. PartsGenie takes in the "template" SBOL document from the preceding SbmlToSbol converter step as input, and uses this a set of instructions to design and optimize synthetic DNA sequences for each gene in the template. The SBOL document is updated with these novel sequences. As PartsGenie is a REST service, a client has been developed to make requests.
OptDoE is based on the optimal design of experiments OptBioDes library 9 to combine selected genetic parts (from PartsGenie) and enzyme variants for the desired pathways from SBML files provided by Selenzyme. The D-optimal experimental design algorithm is based on a logistic regression analysis with an assumed linear model for the response evaluated based on its Defficiency, which compares the design with an orthogonal design. The OptDoE node accepts as input the pathways in SBML format annotated with the enzyme variants and the collection of genetic parts consisting of plasmid copy numbers of the vector backbone, resistance cassette, promoters, and terminator in SBOL format and registered in the SynBioHub repository.
DNA Weaver 10 devises cloning strategies using either Golden Gate Assembly or Gibson Assembly to obtain plasmids for each combination of genetic parts selected by the OptDoE node. As both assembly methods have practical limitations, DNA weaver first considers Golden Gate assembly using the type-IIS enzymes BsmBI, BsaI, or BbsI (in this order) and defaults to Gibson Assembly, although this order of preference can be changed by the user. The resulting assembly strategies produce minimally "scarless" plasmids whose sequence is the direct concatenation of the sequences of the plasmid's parts, with the Golden Gate overlap sequences if this method is used. The node output is a spreadsheet featuring a list of all the primers required to extend the standard genetic parts with sequence homologies necessary for the assembly and a list of all PCRs and fragment assembly operations required to obtain the desired plasmids. The assembly strategy is optimized to maximize primer reuse between constructs and optimize assembly homologies, via the DNA Weaver framework 10 . A specific tool (DNAWeaver_SynbioCAD) has been written for Galaxy-SynBioCAD which takes SBOL designs produced by the Galaxy-SynbioCAD pipeline and device assembly plans based on this particular assembly options, and parameters and logics that are specific to Galaxy-SynbioCAD.
LCR Genie 11 is a web-based tool for supporting the design of bridging oligos, which are required for annealing together individual synthetic DNA parts (designed by PartsGenie) into multi-gene plasmid assemblies, designed by OptDoE. Promoters, RBSs and plasmid backbone are either chosen from a shortlist defined within the software (the default behavior) or provided by the user. Enzyme identifiers are randomly chosen and combined with the aforementioned parts to explore the combinatorics of possible constructs. LCR Genie provides a wrapper for this functionality, taking in an SBOL document containing numerous combinatorial plasmid assemblies, and designing bridging oligos necessary for assembly via the ligase cycling reaction method. The LCR Genie node performs analogous functionality to the DNA weaver node (supporting multi-part assembly but by a different experimental method) and as such, its output format matches that of DNA Weaver.
rpBASICDesign tool extracts enzyme IDs contained in SBML files -produced by tools such as Selenzyme -to generate genetic constructs compliant with the BASIC assembly approach 12 . The BASIC method relies on orthogonal linkers and type IIs restriction enzyme cleavage to provide a robust and accurate assembly of DNA parts into plasmid constructs. Different types of linkers (neutral, methylated Prefix-Suffix, RBS linkers) can be used for the assembly. Predefined sets of such linkers are commercially available as 96 well plates (e.g., BioLegio plates 13 ). rpBASICDesign uses as input an SBML file annotated with enzyme IDs for each reaction, and optionally one or several files listing by their IDs the linkers, the promoters and the backbone used. If not provided, then a default list of parts is used. It produces 3 CSVs and a set of SBOL files. The main one lists the constructs to be built, where each construct is described by a row and consists of a sequence of BASIC linker and DNA part IDs. The 2 other CSV files provide the plate coordinates of the BASIC linkers and the DNA-parts that the user will need to provide. Additionally, one SBOL file is produced for each construct generated. For a given set of enzyme coding genes as standardized DNA parts, several combinations of promoters and RBSs are generated, and permutations of the gene order can be optionally performed in an operon format. 14 software takes as input the CSV files describing the construct (such as those produced by rpBASICDesign) and generates instructions for the automated build of the genetic constructs using OpenTrons liquid handling robots. Optional parameters can be set by the user to define the plastic labwares to be used, and set protocol parameters such as washing or incubation times for purification step. DNA-Bot outputs python scripts that implement the 4 assembly steps, namely Clip reactions, Purification, Assembly, and Transformation. In short, the Clip reactions step prepares the mixes for the ligation of the individual DNA parts with the linkers; the Purification step purifies the linker-ligated DNA parts using magnetic beads and the Opentrons magnetic module; the Assembly step mixes the DNA purified parts to build the final constructs; while the Transformation step transforms the chassis micro-organism with the plasmid and inoculates onto agar. Additional metadata meaningful to keep track of parameters are also outputted by the tool.

DNA-Bot
The Galaxy-SynBioCAD portal does not currently support the visualization of SBOL files such as those produced by PartsGenie and OptDoE, however, these files can be downloaded and visualized using online tools such as VisBOL 15 . Alternatively, they can be uploaded to an SBOL repository and viewed with SBOL visual representation and the newly incorporated sequence viewer.
The Galaxy-SynBioCAD portal also supports other nodes not listed above that perform simple operations like uploading a file, extracting taxonomy ID, or native metabolites from Genomescale metabolic models (GEMs) file (SBML).

Tools design and integration process
The above tools have been put together within a unified and user-friendly interface (Galaxy). As shown in Figure S1, they have been developed as standalone command-line tools without graphical interface and, to be fully compliant with Galaxy packages management, published as Conda packages on Anaconda.org 16 . These packages run mostly on all major Linux distributions, MacOS and Windows and are downloaded from remote channels (conda-forge or bioconda). During the development process, source code changes trigger GitHub actions, a continuous integration platform which helps to automatically run the test processes and displays the build status. We also designed and developed wrappers to make tools available through Galaxy which automatically download the previous packaged tools from anaconda in order to install all the needed dependencies.  Each wrapper is deposited in its own GitHub repository, isolated from the core source code of the corresponding tool. The wrappers are tested using Planemo Utilities 17 , which helps to check the XML validity, to run the tests already written in the test section and finally to publish all the tools to a test ToolShed 18 and the main ToolShed 19 .

Publishing a tool in the Galaxy ToolShed
The publication of a tool is a three stages process as follow: First, the source code is stored in a GitHub repository following a standard organization (see Figure S2). The 'README.md' file presents the purpose of the code, how to install and use it, and how to cite the software. The core source code is stored in a subfolder having the same name as the package. The 'setup.py' file, which includes all the package's metadata and structure, is used to specify how the package should be installed. 'MANIFEST.in' instructs setup tools on which files should be included when an installable project is created. Code test instructions are stored in a 'tests' directory, they are written using the pytest python library, which simplifies write, executions and reporting of unit tests. The 'LICENSE.txt' is added to specify the legal bindings for the use and the distribution of the code (typically an open source such as MIT, GPLv3…). The 'CHANGELOG.md' file lists in chronological order the history of changes. To help in the tool deployment and to simplify its installation, tools are packaged using the Anaconda environment management system 16 . Each package is built by adding a recipe file ('meta.yaml') in the 'recipe' subfolder which describes the package name, its version, its dependencies and how to build it (following conda-build guidelines 20 ).
Whenever a code change is committed, automatic tests are applied to the package through GitHub Actions and CircleCI (GitHub Continuous Integration Platform) which helps to automatically run build and test processes on Linux, MacOS, and Windows environments. The build status is reported on the GitHub branch. Packages are published on anaconda cloud in the conda-forge channel (whenever possible) or in the bioconda channel (as a fallback alternative) following instructions we provided on GitHub 21 ('packaging_bioconda.md' and 'packaging_conda_forge.md' files).
Secondly, the Galaxy wrapper is created to describe the inputs, outputs and parameters of the packaged tool in XML format. This wrapper is tested using Planemo which is a command-line utility to assist in building and publishing Galaxy tools. For example, it helps to check the XML validity and to apply the tests that have been previously written in the wrapper's test section.
Once the wrapper successfully passed the tests, the tool is published in (i) the Test Galaxy Tool Shed 18 , and then in (ii) the Main Galaxy Tool Shed 19 . The tool publication is done following instructions we provide on GitHub 21 ('planemo_test_publish.md' file).
Finally, the Galaxy wrapper of each tool is submitted to the IUC (Intergalactic Utilities Commission) Galaxy community repository 22 . The submission passed through a critical review of Galaxy experts. The IUC also provides computational resources for automated testing through GitHub Actions and Tool Shed deployment 23 .
In parallel to these steps, tutorials are released into Galaxy Training Platform 24 in a newly created category 'Synthetic Biology' for available Galaxy-SynBioCAD workflows. As an example, a published tutorial for the Genetic Design (Basic Assembly) workflow is available 25 .
To support all of these steps and get help, it is worth mentioning Gitter 26 , a chat and networking platform. We recommend using some specific Gitter chat rooms for (i) reaching the Galaxy community -Galaxy Training Network 27 , Galaxy Project 28 -and (ii) reaching the Conda community -conda-forge 29 , Bioconda 30 .
Concerning the development process, we have followed the FAIR (Findability, Accessibility, Interoperability, Reproducibility) principles. Thus, (i) source codes are freely accessible on public repositories, (ii) each tool is published as a package installable on main OS platforms (reusable in space) with versioning capabilities (reusable in time), and (iii) standard formats are used as tools input and output to ensure connections with the outside of the current ecosystem. To go further in FAIR principles, each tool is also available within the Galaxy scientific workflow manager whose key principles are precisely accessibility, transparency and reproducibility of workflows. The Galaxy system provides graphical user interfaces to combine different technologies along with efficient methods for using, sharing and publishing them and thus increasing the efficiency of the scientists using them.

Retrosynthesis workflow
The first workflow, illustrated in Figure S3, is the one that processes retrosynthesis and pathway enumeration. The workflow takes as input: (i) the International Chemical Identifier (InChI) of the compound of interest to produce, (ii) the SMBL genome scale model of the chosen chassis organism, and (iii) the reaction rules (generated by RRules Parser node that calls RetroRules).

Figure S3. RetroSynthesis and Pathway Enumeration workflow. (A) The workflow of tools for retrosynthesis and pathway enumeration. Tools can be chained manually by running each tool one after the other in a command-line terminal. Outputs of each tool can be directly given as inputs of the others without any other processing. (B) The workflow menu at runtime in the Galaxy interface. The user specifies the genome scale SBML model of the host organism and the InChI structure of the target molecule. The user can also change the default settings for each tool by clicking on its name. The RetroRules entry has been set as default for convenience. The workflow generates a collection of heterologous pathways for target production in separate SBML files. (C) The workflow as displayed in the Galaxy workflow Editor.
The workflow generates theoretical possible pathways for the production of a target molecule in a chosen organism. Three key steps are performed in this workflow, these are detailed in the Methods section (cf. Reaction rules, Retrosynthesis from target to sink, Pathway annotation) and summarized next. First, using RetroPath2.0, the workflow generates a network of feasible metabolic routes to produce a target molecule in a selected chassis organism. That metabolic network is then decomposed into individual pathways using RP2paths. Lastly, rpCompletion takes those individual metabolic pathways to filter them (duplicated pathways are removed), then splits them into sub-pathways by adding the appropriate cofactors, and finally converted them to SBML files. Additional details are provided in the Methods section (cf. Pathway completion combinatorics). The Figure S4 is a screenshot of the Retrosynthesis workflow as it appears in Galaxy.

Figure S4. Screen caption of the Retrosynthesis workflow.
The workflow configuration panel highlights the selection of the chassis model from a predefined list. Users can also import models in SBML file format.

Pathway analysis workflow
This workflow ranks a set of pathways based on multiple metrics (flux balance analysis, thermodynamics, pathway length, and reaction SMARTS, cf. Figure S5). The workflow takes as inputs: (i) the list of pathways to rank, and (ii) the structure of metabolites present in the chosen chassis organism (e.g., E. coli model iML1515). Given a set of pathways generated by the Retrosynthesis workflow, Pathway analysis informs the user as to the theoretically best performing taking various criteria calculated by the rpScore tool. The criteria used for scoring are listed below.
1. In the Retrosynthesis workflow, molecules contained within a full SBML model are used to compute heterologous pathways. As a result, the calculated heterologous pathways can easily be merged into the full organism model, enabling the whole-cell context to calculate the production flux of a given target. The method forces a fraction of its maximal flux through the biomass reaction while optimizing for the target molecule. This is achieved through the FBA node. The FBA node is further described in Methods (cf. Flux Balance Analysis with Fraction of Reaction).

Thermodynamics values (based on Gibbs free energies) are computed for each pathway by
using a linear equation system solver (cf. Thermodynamics in Methods) to optimize the yield of the reaction producing the target and to remove intermediate compounds to not clutter up the cell. 3. Enzyme availability for the chemical transformation is also taken into consideration, where high values favor less promiscuous reaction rules and express better confidence. The method used to compute enzyme availability score is described the Methods section (cf. Retrosynthesis from target to chassis). 4. Finally, the length of the pathway is taken into consideration, here shorter pathways are favored over longer pathways.
Lastly, the above metrics are given to a machine learning model (cf. Machine Learning Global Scoring in Methods) in prediction mode to provide a single global score per pathway. The results may be graphically inspected by the user using a Galaxy embedded visualizer (rpViz node). The visualizer displays the heterologous metabolic routes, where complete descriptions of the chemical species, reaction and pathways are displayed. Figure S6 presents displays from rpViz and rpReport for production of Lycopene in E. coli (model iML1515).

Genetic design and engineering workflows
The previous workflow (Pathway Analysis) provides a list of ranked metabolic pathways producing a molecule of interest within a selected chassis organism. The next step is to engineer all or some of these pathways. Pathway selection can be performed in a fully automatic way, by retaining for instance the top ranked pathway. Reviewing the pathways using rpViz is also a good option to let experts browse prediction and select the best implementations. Once pathways have been selected, we provide two genetic design workflows for different assembly protocols.
The first workflow provides assembly plans by using three different techniques: Golden Gate 31 , Gibson 32 , and Ligation Chain Reaction (LCR) 33 . The workflow offers the possibility of using the OptDoE node for combinatorial experimental design. We also provide a second workflow using Biopart Assembly Standard for Idempotent Cloning (BASIC) technique 34 . This workflow takes as input a pathway (in SBML format) and generates a script to operate an Opentrons liquid handler robot which performs assembly and chassis transformation.

Golden Gate, Gibson and LCR assembly workflow
This workflow, illustrated in Figure S7, encodes the top-ranking predicted pathways from the Pathway Analysis workflow into plasmids intended to be expressed in the specified organism. First, the Selenzyme node is executed to return a user-defined number of UniProt ID's associated with each reaction. Then a maximum number of pathways, defined by the user, are converted to an SBOL file. The next tool, PartsGenie, then retrieves the DNA sequences of the predicted enzymes based on their Uniprot ID, performs a codon optimization and creates a first level of library based on those, adding before the CDS some specific strength calculated RBS; these sequences can be output for direct gene synthesis. These parts are then used by OptDoE to design a defined size library of plasmids, expressing at various levels the genes coding for the multiple enzymes present in the predicted pathways. The other genetic parts required by this software (origin of replications, promoters, terminators and markers) are either provided by a default list or a specific list of parts provided by the user which needs to refer to parts stored in SynBioHub. The Galaxy tool "OptDoE Parts Reference Generator" has been written for that purpose. This final genetic design library is generated in an SBOL format and can then be used as an input to other softwares or visualized using tools implementing the SBOL visual standard. The Golden Gate & Gibson assembly workflow ends with two different tools tackling the library construction problem: LCR Genie that proposes an assembly strategy using the Ligase Chain Reaction method and DNA weaver, which calculates the optimal synthesis plan and the assembly protocol following either a Golden Gate or a Gibson Assembly method. The output of LCR Genie or DNA weaver is an excel file containing the full sequences of the plasmid library and of the intermediate parts required to construct them.

BASIC assembly workflow
The Galaxy workflow is depicted in Figure S8.A and further illustrated in the Methods section (cf. Basic Design and DNA-BOT workflow execution). At first, a pathway generated by either the RetroSynthesis or the Pathway Analysis workflows is provided as an SBML file to the Selenzyme tool. Selenzyme searches for enzymes corresponding to each reaction of the pathway, and outputs an updated SBML file annotated with the enzyme UniProt IDs. To restrict the enzyme search to only a subpart of the tree of life (e.g. only enterobacteria) a list of taxonomic IDs can be provided. Second, the BasicDesign tool converts the SBML file into CSV files describing the DNA-parts to be included into each construct (in an operon format). Depending on the numbers of enzymes per reaction, of RBSs and promoters available, and whether or not to perform CDS permutation within the operon, the number of constructs may vary. In the last step, the DNA-Bot tool reads the list of constructs and the DNA-parts position on the source plates and generates a set of python scripts to build the plasmids using an Opentrons liquid handling robot. After downloading these scripts onto a computer connected to an Opentrons, the user can perform the automated construction of the plasmids at the bench. Figure S8. BASIC assembly workflow.

Literature Pathways matching algorithm
An algorithm was designed to quantify the degree of similarity between a true pathway (from the literature) and a list of predicted pathways generated by the Galaxy-SynBioCAD. To test the algorithm, we compiled 77 pathways from literature (cf. Supplementary file 'Dataset 2') and used the Retrosynthesis and Pathway Analysis workflows to generate predicted pathways for the same targets and chassis organisms. Since extracting information from journal articles can be difficult and reports are commonly incomplete, the algorithm reports to what degree of confidence a predicted pathway matches a true one. Let us note that the predicted pathway contains all the necessary information while the true pathway may only contain partial information. Pathways are first compared at the reaction level (all reactions from the one that generated the target to the one that is linked to the chassis organism) then at the pathway level.

Reaction matching score
To compare two reactions rtrue and rpred, all the reactants of the true reaction are compared with all the reactants of the predicted reaction, the same comparison is performed for the products. When several reactants (products) are present in rtrue, one searches in rpred, the most similar reactant (product). Similarity between two given true and pred reactants (products) is carried out using the Morgan fingerprints of the reactants (products) computed using the RDKit library and a Jaccard coefficient (also named Tanimoto coefficient) is calculated from the fingerprints. The chemical reaction score RCH is the averaged Jaccard coefficient computed for all the reactants and the products of rtrue.
The second criteria (REC score) to match reactions is based on the EC numbers of the true and predicted reactions up to the fourth level. If the two reactions have the same first digit EC number then 1/2 is added to REC, if the reactions then match as the second level 1/4 is added, 3/16 is added if the reactions match at the third level, and 1/16 at the fourth level. As RCH, REC ranges from 0 to 1.
Lastly, the two scores are combined with a weighted mean: Similarly, to the species match, a measured reaction can match multiple predicted ones and thus the matches are computed in a matrix and the algorithm selects the best one.

Pathway matching score
Because predicted pathways can be of different lengths to the measured one, we define a pathway length penalty score: Obviously h = 1.0 when the two pathways (true, pred) have the same number of steps. The penalty is applied to the sum of the reaction match score giving the final pathway match score: where is the mean score over all the m reactions in the pathway.

Matching Threshold
To determine a matching score threshold, above which a predicted pathway can be considered identical to a true pathway, we collected for each literature pathway the predicted pathway having the best matching score using eq. (3). Pathways can differ because they use different enzymatic reactions with different substrates and products. While the predicted pathways generated by Galaxy-SynBioCAD contain all the substrates and products of the reactions, cofactors (cosubstrates and coproducts) are generally not reported in the literature and matching score can be lowered because of these missing cofactors. To verify that the main substrate and main product of a given true pathway are accounted for in a predicted pathway, we compute a similarity (using Jaccard coefficient) between the two pathways removing cofactors. Precisely, for each reaction of the true pathway a chemical reaction score (RCH as defined in eq. (1)) is computed without taking cofactor into account, in other words one searches in the predicted pathways the substrate (product) most similar to the one found in the true reaction. The similarity between the true and predicted pathway is the average RCH value for each true reaction. A Jaccard coefficient of 1 indicates that the main substrates and products for all reactions of a literature pathway are retrieved in the predicted pathway.
One observes from Figure 2.D (main text) that when the matching score is above 0.5 the predicted reactions are nearly identical to the literature reactions (with a pathway similarity of 1). Consequently, any pathway generated by Galaxy-SynBioCAD is labeled 'literature pathway' if its score is above 0.5. Figure S9 gives the distribution of matching scores obtained for all generated pathways, about 20% are passing the threshold constraint. Figure S9: Score distribution of all predicted pathways.
The total number of pathways generated is 5874, 1222 (~20%) of which have a score above 0.5.